Identifying and Tagging Titles in Web Texts
نویسندگان
چکیده
In this paper, we present an analysis based on linguistic and typographic features that allows for the identification of titles in web documents. We focus in particular on procedural texts. Identifying texts is a difficult task because ways pf encoding them are very diverse. A number of titles are also incomplete because fo context, we propose also a way to retrieve the missing elements, in particular predicates, so that titles are fully intelligible.
منابع مشابه
Toponym recognition in custom-made map titles
The titles of customized topographic maps constitute a specific corpus which is characterized by a very significant number of place names and spelling variations. This paper is about identifying toponyms in these titles. The toponym tracking is based on gazetteers as well as light parsing according to patterns. The method used broadens the definition of the toponym to include the nature of the ...
متن کاملA POS Tagger for Social Media Texts Trained on Web Comments
Using social media tools such as blogs and forums have become more and more popular in recent years. Hence, a huge collection of social media texts from different communities is available for accessing user opinions, e.g., for marketing studies or acceptance research. Typically, methods from Natural Language Processing are applied to social media texts to automatically recognize user opinions. ...
متن کاملWeb Mining for an Amharic - English Bilingual Corpus
We present recent work aimed at constructing a bilingual corpus consisting of comparable Amharic and English news texts. The Amharic and English texts were collected from an Ethiopian news agency that publishes daily news in Amharic and English through their web page. The Amharic texts are represented using Ethiopic script and archived according to the Ethiopian calender. The overlap between th...
متن کاملMulti-value Classification of Very Short Texts
We introduce a new stacking-like approach for multi-value classification. We apply this classification scheme using Naive Bayes, Rocchio and kNN classifiers on the well-known Reuters dataset. We use part-of-speech tagging for stopword removal. We show that our setup performs almost as well as other approaches that use the full article text even though we only classify headlines. Finally, we app...
متن کاملAdaptive Model for Integrating Different Types of Associated Texts for Automated Annotation of Web Images
A lot of texts are associated with Web images, such as image file name, ALT texts, surrounding texts etc on the corresponding Web pages. It is well known that the semantics of Web images are well correlated with these associated texts, and thus they can be used to infer the semantics of Web images. However, different types of associated texts may play different roles in deriving the semantics o...
متن کامل